把服務的「可觀測性」從只有結構化日誌擴展到指標 Metrics與追蹤 Traces。用 OpenTelemetry(OTel)做統一的 SDK、協定與傳輸,搭配 Collector 接到你愛的後端(Prometheus、Tempo/Jaeger、Datadog、Cloud Monitoring…)。最後把三者串起來:Log 內含 trace_id,Metrics 與 Traces 互跳,問題從「感覺」變「證據」。
雲原生場景下你會同時需要:
OpenTelemetry 把這些標準化:
常見兩種拓樸:
App (SDK) --OTLP--> 後端(如 Tempo/Prometheus 適配)
App (SDK) --OTLP--> OTel Collector --export--> Tempo/Jaeger + Prometheus/OTLP + Logs Sink
好處:集中重試、緩衝、過濾、取樣與多目標輸出;部署彈性更大。
dev 或 obs extra)# pyproject.toml
[project.optional-dependencies]
obs = [
  "opentelemetry-sdk>=1.27",
  "opentelemetry-exporter-otlp>=1.27",
  "opentelemetry-instrumentation-fastapi>=0.49b0",
  "opentelemetry-instrumentation-httpx>=0.49b0",
  "opentelemetry-instrumentation-logging>=0.49b0",
  "opentelemetry-instrumentation-requests>=0.49b0",
  "opentelemetry-instrumentation-sqlalchemy>=0.49b0",
  "prometheus-client>=0.20"  # 若需要 /metrics(Pull 模式)
]
[tool.hatch.envs.obs]
features = ["obs"]
export OTEL_SERVICE_NAME=awesome-api
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_RESOURCE_ATTRIBUTES=service.version=1.2.3,service.namespace=payments,env=dev
# 取樣比率(0~1)
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.2
# src/my_project/obs/telemetry.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
import os
def setup_otel() -> None:
    res = Resource.create({
        "service.name": os.getenv("OTEL_SERVICE_NAME", "awesome-api"),
        "service.version": os.getenv("SERVICE_VERSION", "0.0.0"),
        "service.namespace": os.getenv("SERVICE_NAMESPACE", "default"),
        "deployment.environment": os.getenv("ENV", "dev"),
    })
    # Traces
    tp = TracerProvider(resource=res)
    tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))  # OTLP gRPC 4317
    trace.set_tracer_provider(tp)
    # Metrics
    reader = PeriodicExportingMetricReader(OTLPMetricExporter())   # OTLP gRPC 4317
    mp = MeterProvider(resource=res, metric_readers=[reader])
    metrics.set_meter_provider(mp)
# src/my_project/adapters/web/app.py
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from my_project.obs.telemetry import setup_otel
def create_app() -> FastAPI:
    setup_otel()
    app = FastAPI(title="Awesome API", version="1.2.3")
    # 自動化框架/庫的追蹤
    FastAPIInstrumentor.instrument_app(app)
    HTTPXClientInstrumentor().instrument()
    RequestsInstrumentor().instrument()
    # SQLAlchemyInstrumentor().instrument(engine=your_engine)  # 若有使用
    @app.get("/healthz")
    def healthz(): return {"ok": True}
    return app
app = create_app()
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def compute_quote(user_id: str, items: list[dict]) -> int:
    with tracer.start_as_current_span("compute_quote") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("items.count", len(items))
        # ... heavy logic ...
        price = 123
        span.add_event("quote_computed", {"price": price})
        return price
實作例:HTTP 業務成功率與延遲直方圖
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
orders_success = meter.create_counter("orders_success_total")
orders_failed  = meter.create_counter("orders_failed_total")
latency = meter.create_histogram("orders_latency_ms", unit="ms")
def place_order(user_id: str, payload: dict) -> str:
    import time
    t0 = time.perf_counter()
    try:
        # ... business ...
        orders_success.add(1, {"route": "POST /v1/orders"})
        return "ok"
    except Exception:
        orders_failed.add(1, {"route": "POST /v1/orders"})
        raise
    finally:
        latency.record((time.perf_counter() - t0) * 1000, {"route": "POST /v1/orders"})
SLO 入門:
RED(Requests, Errors, Duration)套在每個 API。
USE(Utilization, Saturation, Errors)套在資源(CPU、連線池、佇列)。
指標一律帶上
route、status_class、env、service.version這些標籤,告警與分群才有意義。
延續 JSON Log,把目前上下文的追蹤資訊注入:
# src/my_project/logging_config.py
import logging, structlog
from opentelemetry.trace import get_current_span
def _otel_ids(_, __, event_dict):
    span = get_current_span()
    ctx = span.get_span_context()
    if ctx and ctx.is_valid:
        event_dict["trace_id"] = f"{ctx.trace_id:032x}"
        event_dict["span_id"]  = f"{ctx.span_id:016x}"
    return event_dict
def setup_logging():
    logging.basicConfig(level=logging.INFO)
    structlog.configure(
        processors=[
            structlog.processors.add_log_level,
            _otel_ids,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.JSONRenderer(),
        ]
    )
這樣查到一筆錯誤 Log,可以直接用 trace_id 打開整條分散式追蹤。
otel-collector.yaml(接收 OTLP,輸出到 Tempo 與 Prometheus Remote Write 只是示意)
receivers:
  otlp:
    protocols:
      grpc:
      http:
exporters:
  otlp:
    endpoint: tempo:4317  # Tempo/Jaeger 也可
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:9464"  # 由 Prometheus 來拉 metrics(亦可 remote_write)
processors:
  batch: {}
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
# docker-compose.yml
version: "3.9"
services:
  otel-collector:
    image: otel/opentelemetry-collector:latest
    command: ["--config=/etc/otel.yaml"]
    volumes:
      - ./otel-collector.yaml:/etc/otel.yaml:ro
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "9464:9464"   # Prometheus metrics exporter
  tempo:
    image: grafana/tempo:latest
    ports: ["3200:3200"]  # Tempo query
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
prometheus.yml(拉 Collector 的 /metrics)
global: { scrape_interval: 15s }
scrape_configs:
  - job_name: "otelcol"
    static_configs: [{ targets: ["otel-collector:9464"] }]
[tool.hatch.envs.obs.scripts]
up = [
  "docker compose up -d otel-collector tempo prometheus grafana",
  "python -c \"print('otel up')\""
]
down = "docker compose down"
serve = "uvicorn my_project.adapters.web.app:app --host 0.0.0.0 --port 8000"
OTEL_EXPORTER_OTLP_ENDPOINT 配好。deployment.environment=prod 進資源屬性。readinessProbe 只檢查輕量依賴;terminationGracePeriodSeconds 至少 10s,讓 BatchSpanProcessor 能 flush。graceful-timeout,避免 span 丟失。http.server.request.duration 直方圖建立 P95/P99 告警門檻;error_rate > 某閾值 拉 Pager。client.duration、client.error_rate;快取命中率與 DB 連線池飽和度。| 症狀 | 可能原因 | 修法 | 
|---|---|---|
| 看不到部分 span | 取樣比率太低;Batch 未 flush | 暫時用 OTEL_TRACES_SAMPLER=always_on;關閉前 force_flush();加長 OTEL_BSP_SCHEDULE_DELAY | 
| 指標不更新 | 只記錄了 Counter,忘了 Histogram | 為延遲與大小使用 Histogram;確認 Reader/Exporter 週期 | 
| 後端 429 或丟資料 | Collector 未啟 batch/重試 | 在 Collector 加 batch 與 retry(extensions/processors) | 
| Log 無 trace_id | 設定順序不對,或 Log 在 Span 外 | 確保先 setup_otel() 再 setup_logging();在請求上下文內寫 log | 
| 端到端不連 | 代理或防火牆擋 4317/4318 | 用 HTTP 4318 或在內網架 Collector,再由 Collector 出網 | 
到這一步,你的服務不只會跑,還看得見:Log 有上下文、Trace 有鍊、Metrics 有分佈,Collector 把資料穩穩送走。從此線上問題少靠猜,多靠圖。下一次線上延遲拉高,你會先打開延遲直方圖與特定 route 的 trace,再決定要不要回退或擴容,而不是在螢幕前面念經。把之前的文章 一起看,拼圖才完整。